This report explores a dataset containing prices and attributes for approximately 22,000 houses. The tidy data was taken from https://www.kaggle.com/harlfoxem/housesalesprediction. Originally, information about house sales was gathered from the official public records. We are going to work here with the dataset, which contains information about homes sold between May 2014 and May 2015 in King County, Washington State.
## [1] 21613 16
## 'data.frame': 21613 obs. of 16 variables:
## $ date : Factor w/ 372 levels "20140502T000000",..: 165 221 291 221 284 11 57 252 340 306 ...
## $ price : num 221900 538000 180000 604000 510000 ...
## $ bedrooms : int 3 3 2 4 3 4 3 3 3 3 ...
## $ bathrooms : num 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
## $ sqft_living : int 1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
## $ sqft_lot : int 5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
## $ waterfront : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ view : Ord.factor w/ 5 levels "0"<"1"<"2"<"3"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ condition : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 3 3 3 5 3 3 3 3 3 3 ...
## $ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
## $ yr_renovated : int 0 1991 0 0 0 0 0 0 0 0 ...
## $ month : Ord.factor w/ 12 levels "1"<"2"<"3"<"4"<..: 10 12 2 12 2 5 6 1 4 3 ...
## $ day : Ord.factor w/ 31 levels "1"<"2"<"3"<"4"<..: 13 9 25 9 18 12 27 15 15 12 ...
## $ formatted_date : Ord.factor w/ 13 levels "14 May"<"14 Jun"<..: 6 8 10 8 10 1 2 9 12 11 ...
## $ price_thousands : num 222 538 180 604 510 ...
## $ yr_build_or_renovated: int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
Our dataset consists of 21 variables, with almost 22,000 observations.
## date price bedrooms
## 20140623T000000: 142 Min. : 75000 Min. : 0.000
## 20140625T000000: 131 1st Qu.: 321950 1st Qu.: 3.000
## 20140626T000000: 131 Median : 450000 Median : 3.000
## 20140708T000000: 127 Mean : 540088 Mean : 3.371
## 20150427T000000: 126 3rd Qu.: 645000 3rd Qu.: 4.000
## 20150325T000000: 123 Max. :7700000 Max. :33.000
## (Other) :20833
## bathrooms sqft_living sqft_lot waterfront view
## Min. :0.000 Min. : 290 Min. : 520 0:21450 0:19489
## 1st Qu.:1.750 1st Qu.: 1427 1st Qu.: 5040 1: 163 1: 332
## Median :2.250 Median : 1910 Median : 7618 2: 963
## Mean :2.115 Mean : 2080 Mean : 15107 3: 510
## 3rd Qu.:2.500 3rd Qu.: 2550 3rd Qu.: 10688 4: 319
## Max. :8.000 Max. :13540 Max. :1651359
##
## condition yr_built yr_renovated month day
## 1: 30 Min. :1900 Min. : 0.0 5 :2414 23 : 906
## 2: 172 1st Qu.:1951 1st Qu.: 0.0 4 :2231 9 : 808
## 3:14031 Median :1975 Median : 0.0 7 :2211 5 : 807
## 4: 5679 Mean :1971 Mean : 84.4 6 :2180 24 : 801
## 5: 1701 3rd Qu.:1997 3rd Qu.: 0.0 8 :1940 20 : 787
## Max. :2015 Max. :2015.0 10 :1878 16 : 759
## (Other):8759 (Other):16745
## formatted_date price_thousands yr_build_or_renovated
## 15 Apr :2231 Min. : 75.0 Min. :2015
## 14 Jul :2211 1st Qu.: 321.9 1st Qu.:2015
## 14 Jun :2180 Median : 450.0 Median :2015
## 14 Aug :1940 Mean : 540.1 Mean :2015
## 14 Oct :1878 3rd Qu.: 645.0 3rd Qu.:2015
## 15 Mar :1875 Max. :7700.0 Max. :2015
## (Other):9298
Above is original and transformed data. As an original plot has a long tail, logarithmic trahsformations was appplied for better understanding of the price distribution. The tranformed price distribution appears to be approximately normal with the price peaking around 450 thousands. Now let’s see how this plot looks like across the categorical variables of waterfornt, view and condition.
As we can see, the most houses are not overlooking the waterfront and have view graded as zero. Condition is skewed to the right, with most houses of condition 3 and greater. Also interesting to look at the distribution of day and month when house was sold.
The month data has two peaks in July 2014 and April 2015, and have significantly lower count of sold houses during winter. The low count of houses sold in May 2015 can be explained by the fact that presented data is not for a full month. Distribution of days looks uniform, with significantly low value at 31st. But this could be explained by absense of 31st in about a half of months, so no surprise here.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 3.000 3.000 3.371 4.000 33.000
##
## 0 1 2 3 4 5 6 7 8 9 10 11 33
## 13 199 2760 9824 6882 1601 272 38 13 6 3 1 1
The smaller houses has 0 bedrooms and the largest one has 33 bedrooms (wow!). The plots above exclude the house with 33 bedrooms as it’s clearly an outlier. All houses has whole number of bedrooms and most popular are houses with 3 bedrooms.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.750 2.250 2.115 2.500 8.000
##
## 0 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5 2.75 3 3.25 3.5 3.75
## 10 4 72 3852 9 1446 3048 1930 2047 5380 1185 753 589 731 155
## 4 4.25 4.5 4.75 5 5.25 5.5 5.75 6 6.25 6.5 6.75 7.5 7.75 8
## 136 79 100 23 21 13 10 4 6 2 2 2 1 1 2
The maximal nuber of bathrooms is 8 and all values of bathrooms are multiplies of 0.25. Most popular houses have 2.5 bathrooms. What is suspicious, that is a ten houses with zero bathrooms. Let’s look at this houses closer.
## date price bedrooms bathrooms sqft_living sqft_lot
## 876 20140612T000000 1095000 0 0 3064 4764
## 1150 20150217T000000 75000 1 0 670 43377
## 3120 20150205T000000 380000 0 0 1470 979
## 5833 20141104T000000 280000 1 0 600 24501
## 6995 20140624T000000 1295650 0 0 4810 28008
## 9774 20150429T000000 355000 0 0 2460 8049
## 9855 20141223T000000 235000 0 0 1470 4800
## 10482 20140918T000000 484000 1 0 690 23244
## 14424 20150413T000000 139950 0 0 844 4269
## 19453 20140926T000000 142000 0 0 290 20875
## waterfront view condition yr_built yr_renovated month day
## 876 0 2 3 1990 0 6 12
## 1150 0 0 3 1966 0 2 17
## 3120 0 2 3 2006 0 2 5
## 5833 0 0 2 1950 0 11 4
## 6995 0 0 3 1990 0 6 24
## 9774 0 0 3 1990 0 4 29
## 9855 0 0 3 1996 0 12 23
## 10482 0 0 4 1948 0 9 18
## 14424 0 0 4 1913 0 4 13
## 19453 0 0 1 1963 0 9 26
## formatted_date price_thousands yr_build_or_renovated
## 876 14 Jun 1095.00 2015
## 1150 15 Feb 75.00 2015
## 3120 15 Feb 380.00 2015
## 5833 14 Nov 280.00 2015
## 6995 14 Jun 1295.65 2015
## 9774 15 Apr 355.00 2015
## 9855 14 Dec 235.00 2015
## 10482 14 Sep 484.00 2015
## 14424 15 Apr 139.95 2015
## 19453 14 Sep 142.00 2015
I guess it is what it is: some houses don’t have bathrooms. Other information about those houses looks valid.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 290 1427 1910 2080 2550 13540
Above is original and transformed data. As an original plot has a long tail, logarithmic trahsformations was appplied for better understanding of the living area distribution. Most houses have a living area between 1400 sqft and 2600 sqft: median 1910 sqft and mean 2080 sqft.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 520 5040 7618 15110 10690 1651000
And again, we applied logarithmic trahsformation to the lot area. Most houses have a lot area between 5040 sqft and 10690 sqft: median 7618 sqft and mean 15110 sqft.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1900 1951 1975 1971 1997 2015
Interestingly, this plot in some degree shows economic situation in the country. For example we can see the lowest values around 1930 (Great Depression) and drop around 2008 (Financial Crisis). Significantly low value for 2015 can be explained by the fact, that data collected from May 2014 to May 2015. Most houses was built between 1975 and 1997.
##
## 0 1934 1940 1944 1945 1946 1948 1950 1951 1953 1954 1955
## 20699 1 2 1 3 2 1 2 1 3 1 3
## 1956 1957 1958 1959 1960 1962 1963 1964 1965 1967 1968 1969
## 3 3 5 1 4 2 4 5 5 2 8 4
## 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981
## 9 2 4 5 3 6 3 8 6 10 11 5
## 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993
## 11 18 18 17 17 18 15 22 25 20 17 19
## 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
## 19 16 15 15 19 17 35 19 22 36 26 35
## 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
## 24 35 18 22 18 13 11 37 91 16
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1934 1987 2000 1996 2007 2015
I excluded data for not renovated houses in second plot in order to zoom in for data around 2000. We can see that a lot of renovations was made in 2014, which is understandable: owners renovated their old houses right before sale. Amongst renovated houses most of them were renovated between 1987 and 2007: median 2000 and mean 1996.
There are 21,613 houses in the dataset with 10 features (date, price, bedrooms, bathrooms, sqft_living, sqft_lot, waterfront, view, condition, yr_built, yr_renovated). The variables waterfront, view, and condition, are ordered factor variables with the following levels.
(worst) —————-> (best)
waterfront: 0, 1
view: 0, 1, 2, 3, 4
condition: 1, 2, 3, 4, 5
Other observations:
The logarithmically tranformed price distribution appears to be approximately normal with the price peaking around 450 thousands dollars
Most houses are not overlooking the waterfront and have view graded as zero
Condition is skewed to the right, with most houses of condition 3 and greater
All houses has whole number of bedrooms and most popular are houses with 3 bedrooms.
A lot of renovations was made in 2014. Amongst renovated houses most of them were renovated between 1987 and 2007.
The main features of interest in the dataset is price. I’d like to determine which features are best for predicting the price of a house. I suspect some combination of the variables can be used to build a predictive model to price houses.
View, condition, bedrooms, sqft_living, sqft_lot and yr_build are likely contribute to the price of a house. I think sqft_living, condition and yr_build contribute most to the price after researching information on house prices. One of the most importat things in house pricing is location. But I decided to exclude this variable as the all data is from the same county.
I created two variable for the day and month of purchase of using the information from date. I’ve heard that during winter prices are lower, so it was particulary interesting for me to check this.
I log-transformed the right skewed price, sqft_living and sqft_lot ditstributions.
The numbers of bedrooms and bathrooms tend to correlate with each other. The greater number of bathrooms or bedrooms, then the larger the whole living area of the house. Price correlates strongly with living area and number of bathrooms and bedrooms.
From this data, sqft_lot, yr_built and yr_renovated do not seem to have significatn correlations with price, but year when house was built moderately correlates with number of bathrooms and living area. I want to look closer at scatter plots involving price and some other variables like sqft_living, sqft_lot, bedrooms, bathrooms, yr_built, yr_renovated.
As living area increases, the variance in price increases. The most houses represent a cloud with a high density below 700 thousands price. The relationship between price and sqft_living appears to be linear.
There is no much correlation between price and sqft_lot, but we can see some vertical bands where many houses with the same lot area have different price points. This bands could be explained by estimating lot area with the round values.
##
## Pearson's product-moment correlation
##
## data: bedrooms and price_thousands
## t = 47.651, df = 21611, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2962354 0.3203646
## sample estimates:
## cor
## 0.3083496
Again, the tall vertical strips indicate bedrooms numbers are integers. Adding jitter, transparency, and changing the plot limits lets us see the slight correlation between bedrooms and price.
##
## Pearson's product-moment correlation
##
## data: bathrooms and price_thousands
## t = 90.714, df = 21611, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5154140 0.5347258
## sample estimates:
## cor
## 0.5251375
Then again, the tall vertical strips indicate bathrooms numbers are multiplies 0.25. Adding jitter, transparency, and changing the plot limits lets us see the correlation between bathrooms and price.
We can see two outliers with price over 4000 before 1950 and significantly more houses with extremely high price built after 1975. Adding transparency and adjusting limits didn’t reveal any strong patterns. We can see that medain and mean stays approximately same for different yr_built. But what is clear, is that most houses was built after 1950, as we can see there are more points there.
Amongst renovated houses we can see weak correlation between price and yr_renovated. Also form this plot it’s clear that most houses was renovated recently.
Next, I’ll look at how the categorical features vary with sqft_living and price.
In general houses with waterfront seems to have larger living area.
##
## Spearman's rank correlation rho
##
## data: price_thousands and as.numeric(waterfront)
## S = 1.489e+12, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.1150893
Houses with waterfront seems to have a higher price, but we also should remember that in could be partially result of larger living area. Variation of price is higher for houses with waterfront. Spearman’s coefficient shows us that correlation is very weak.
There is a tendency that houses with better view have larger living area.
##
## Spearman's rank correlation rho
##
## data: price_thousands and as.numeric(view)
## S = 1.1881e+12, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.2939312
## Df Sum Sq Mean Sq F value Pr(>F)
## view 4 4.901e+08 122519732 1093 <2e-16 ***
## Residuals 21608 2.423e+09 112127
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Pairwise comparisons using t tests with pooled SD
##
## data: price_thousands and view
##
## 0 1 2 3
## 1 < 2e-16 - - -
## 2 < 2e-16 0.35 - -
## 3 < 2e-16 2.8e-11 < 2e-16 -
## 4 < 2e-16 < 2e-16 < 2e-16 < 2e-16
##
## P value adjustment method: holm
Houses with better view seems to have higher price, but then again, we should’n forget about posible effect of larger living area. Variation of price have a tendency to increase with better levels of view. Spearmans rank shows us the a weak correlation. All pairs of view, except 1 and 2 appear to be significantly different at alpha = .05
## condition: 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 290 820 1000 1216 1490 3080
## --------------------------------------------------------
## condition: 2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 390 975 1320 1410 1700 5440
## --------------------------------------------------------
## condition: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 380 1460 1970 2149 2660 13540
## --------------------------------------------------------
## condition: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 390 1370 1820 1951 2350 12050
## --------------------------------------------------------
## condition: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 370 1430 1880 2023 2440 7710
It seems that houses with poor condition (below 3) have smaller living area, but for other conditions sqft_living distributions are similar. Although condition 3 have highest median and 25%, 75% quantiles.
## condition: 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 78.0 160.0 262.5 334.4 431.1 1500.0
## --------------------------------------------------------
## condition: 2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 80.0 189.8 279.0 327.3 397.3 2555.0
## --------------------------------------------------------
## condition: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 75.0 329.5 450.0 542.0 640.0 7062.0
## --------------------------------------------------------
## condition: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 89.0 305.0 440.0 521.2 625.0 7700.0
## --------------------------------------------------------
## condition: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 110.0 350.0 526.0 612.4 725.0 3650.0
##
## Spearman's rank correlation rho
##
## data: price_thousands and as.numeric(condition)
## S = 1.6515e+12, p-value = 0.006561
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.01848996
## Df Sum Sq Mean Sq F value Pr(>F)
## condition 4 2.003e+07 5008662 37.41 <2e-16 ***
## Residuals 21608 2.893e+09 133880
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Pairwise comparisons using t tests with pooled SD
##
## data: price_thousands and condition
##
## 1 2 3 4
## 2 0.92139 - - -
## 3 0.00573 1.7e-13 - -
## 4 0.01060 4.6e-11 0.00120 -
## 5 0.00019 < 2e-16 4.8e-13 < 2e-16
##
## P value adjustment method: holm
Here we see the similar to living area picture. First two conditions have lower price, but other conditions are not very different. Although excellent condition(5) has highest quartiles for price. Spearmans rank shows us very weak correlation. All pairs of conditions, except 1 and 2, appear to be significantly different at alpha = .05
It seems that living area have similar distributions during different month, but let’s zoom in.
## formatted_date: 14 May
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 430 1430 1945 2115 2600 13540
## --------------------------------------------------------
## formatted_date: 14 Jun
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 380 1470 1980 2133 2610 10040
## --------------------------------------------------------
## formatted_date: 14 Jul
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 370 1460 1950 2128 2600 8020
## --------------------------------------------------------
## formatted_date: 14 Aug
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 480 1440 1940 2085 2540 9200
## --------------------------------------------------------
## formatted_date: 14 Sep
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 290 1440 1920 2077 2540 9890
## --------------------------------------------------------
## formatted_date: 14 Oct
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 384 1410 1900 2084 2550 12050
## --------------------------------------------------------
## formatted_date: 14 Nov
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 460 1390 1870 2049 2530 7620
## --------------------------------------------------------
## formatted_date: 14 Dec
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 520 1440 1900 2091 2560 7400
## --------------------------------------------------------
## formatted_date: 15 Jan
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 410 1440 1890 2085 2570 7880
## --------------------------------------------------------
## formatted_date: 15 Feb
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 470 1380 1830 2011 2460 7220
## --------------------------------------------------------
## formatted_date: 15 Mar
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 420 1380 1870 2022 2485 6810
## --------------------------------------------------------
## formatted_date: 15 Apr
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 520 1420 1900 2060 2510 8000
## --------------------------------------------------------
## formatted_date: 15 May
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 390 1360 1845 2024 2458 7440
The highest living area median is in 2014 July (1950 sqft) and the lowest median is in 2015 February(1830 sqft). But actually 120 sqft is not much difference for houses, so I woudn’t say that living area isn’t similar by months.
Let’s zoom in
## formatted_date: 14 May
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 78.0 326.0 465.0 548.1 650.9 3710.0
## --------------------------------------------------------
## formatted_date: 14 Jun
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 92 330 465 558 665 7062
## --------------------------------------------------------
## formatted_date: 14 Jul
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 86.5 325.0 465.0 544.8 653.9 3800.0
## --------------------------------------------------------
## formatted_date: 14 Aug
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 107.0 324.6 442.1 536.4 640.0 5570.0
## --------------------------------------------------------
## formatted_date: 14 Sep
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 85.0 325.0 450.0 529.3 624.4 6885.0
## --------------------------------------------------------
## formatted_date: 14 Oct
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 89.0 315.0 446.9 539.0 640.0 7700.0
## --------------------------------------------------------
## formatted_date: 14 Nov
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 82.0 308.6 435.0 522.0 616.0 3850.0
## --------------------------------------------------------
## formatted_date: 14 Dec
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 95.0 312.2 432.5 524.5 626.6 3300.0
## --------------------------------------------------------
## formatted_date: 15 Jan
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 99.0 310.2 438.5 525.9 625.0 3567.0
## --------------------------------------------------------
## formatted_date: 15 Feb
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 75.0 309.0 425.5 507.9 600.0 3065.0
## --------------------------------------------------------
## formatted_date: 15 Mar
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 81.0 319.9 450.0 544.0 655.0 3395.0
## --------------------------------------------------------
## formatted_date: 15 Apr
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 84.0 338.4 476.5 561.8 668.2 5350.0
## --------------------------------------------------------
## formatted_date: 15 May
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 95.0 342.1 455.0 558.1 625.0 4208.0
##
## Spearman's rank correlation rho
##
## data: price_thousands and as.numeric(month)
## S = 1.7087e+12, p-value = 0.02273
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.01549432
Then again, the highest median price is in 2014 June and July (465 thousands) and the lowest median price is in 2015 February (425.5 thousands). In general there is kind of peak in summer and slight price drop in winter. So, I guess, seasons affect the price very slightly, if at all. Spearmans rank shows us very weak correlation.
Price correlates strongly with living area and number of bathrooms. Also price slightly correlates with number of bedrooms. I found this interesting, that number of bathrooms correlates with price stronger, than with number of bathrooms.
As living area increases, the variance in price increases. In the plot of price vs sqft_living The most houses represent a cloud with a high density below 700 thousands price. The relationship between price and sqft_living appears to be exponential rather than linear.
From this data, sqft_lot, yr_built and yr_renovated do not seem to have significatn correlations with price, but year when house was built moderately correlates with number of bathrooms and living area.
Houses overlooking waterfrong have much higher median price than houses without waterfront. Also price variance is higher for houses with waterfront.
Houses with better levels of view and condition tend to occur more often at lower prices while houses with worse levels of view and condition tend to occur more often at higher prices.
Houses in condition 3 out of 5 have the highes median living area. The median price for condition 4 is lower than for condition 3, which is interesting.
Variation of price have a tendency to increase with better levels of view and then decrease for excellent view (4).
The highest median price is in June and July and the lowest median price is in February. The median price is typically lower in winter and higher in summber.
The numbers of bedrooms and bathrooms tend to correlate with each other. The greater number of bathrooms or bedrooms, then the larger the whole living area of the house which makes sense.
The price of a house is positively and strongly correlated with living area. Number of bedrooms and bathrooms correlate with the price, but less strongly than living area. Variable sqft_living could be used in a model to make house price predictions. But we shouldn’t use bathrooms and bedrooms then, because they measuring the same quality and show strong correlation with sqft_living.
. Let’s now take a look at price / sqft living in order to avoid possible influence of sqft_living to the price.
## waterfront: 0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 87.59 181.80 243.90 262.30 316.70 810.10
## --------------------------------------------------------
## waterfront: 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 117.7 365.8 519.1 508.1 646.0 800.0
Median price per sqft for houses with waterfront is twice higher than median price per sqft for houses without waterfront. Variance of price/sqft is also higher for houses with waterfront. Although both minimums and maximums are close.
## view: 0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 87.59 178.90 239.30 256.90 309.10 810.10
## --------------------------------------------------------
## view: 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 103.6 227.8 297.9 319.9 384.7 727.1
## --------------------------------------------------------
## view: 2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 89.2 215.2 285.2 304.3 363.0 756.6
## --------------------------------------------------------
## view: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 97.04 227.90 298.40 322.50 391.10 758.40
## --------------------------------------------------------
## view: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 126.4 309.1 406.0 434.8 564.1 800.0
## Df Sum Sq Mean Sq F value Pr(>F)
## view 4 14629168 3657292 319.7 <2e-16 ***
## Residuals 21608 247165160 11439
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Pairwise comparisons using t tests with pooled SD
##
## data: price/sqft_living and view
##
## 0 1 2 3
## 1 <2e-16 - - -
## 2 <2e-16 0.0431 - -
## 3 <2e-16 0.7355 0.0057 -
## 4 <2e-16 <2e-16 <2e-16 <2e-16
##
## P value adjustment method: holm
The picture now is not so monotonic as it was for a full price. The median price/sqft for view 2 violates the pattern, as it’s lower than median for worse view 1. Only pair of views 1 and 3 appears to be not significantly different at alpha = .05
## condition: 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 95.38 149.40 246.80 290.40 393.30 785.30
## --------------------------------------------------------
## condition: 2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 88.79 156.20 210.60 244.60 304.70 792.70
## --------------------------------------------------------
## condition: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 87.59 179.80 239.30 257.40 305.00 810.10
## --------------------------------------------------------
## condition: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 88.08 181.60 250.00 270.90 329.40 792.10
## --------------------------------------------------------
## condition: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 88.0 212.3 284.7 299.0 363.9 758.4
## Df Sum Sq Mean Sq F value Pr(>F)
## condition 4 3046131 761533 63.59 <2e-16 ***
## Residuals 21608 258748196 11975
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Pairwise comparisons using t tests with pooled SD
##
## data: price/sqft_living and condition
##
## 1 2 3 4
## 2 0.173 - - -
## 3 0.397 0.397 - -
## 4 0.661 0.012 4.0e-14 -
## 5 0.669 3.8e-09 < 2e-16 < 2e-16
##
## P value adjustment method: holm
From condition 2 the better conditions have higher median price per sqft. But condition 1 has median price/sqft higher than condition 2 and 3. Also price/sqft has a largest variance for houses in condition 1. Also condition 1 appears to be not significantly different from any other condition at alpha = .05, as well as pairs 2, 3 and 2, 4.
Here we can see the price/sqft median bump in April and then slight decrese from Novermber to January. Variance in each month is approximately the same.
Most houses have a condition level 3 and higher. Interestingly, almost all houses built after 2000 seems to be in condition 3. The pattern holds across each level of view and each level of waterfront.
View and waterfront are not correlated with yr_built, nothing particularly stands out.
Condition, view and waterfront are not correlated with yr_renovated, nothing particularly stands out.
If we account for constant sqft_living value, better view produces a higher-priced house.
The pattern is not noticeable here.
The last 3 plots suggest that we can build a linear model and use those variables in the linear model to predict the price of a house.
##
## Calls:
## m0: lm(formula = I(price_thousands) ~ I(sqft_living), data = kc_house)
## m1: lm(formula = I(log10(price_thousands)) ~ I((sqft_living)^(1/2)),
## data = kc_house)
##
## ======================================================
## m0 m1
## ------------------------------------------------------
## (Intercept) -43.581*** 1.925***
## (4.403) (0.005)
## I(sqft_living) 0.281***
## (0.002)
## I((sqft_living)^(1/2)) 0.017***
## (0.000)
## ------------------------------------------------------
## R-squared 0.493 0.481
## adj. R-squared 0.493 0.481
## sigma 261.453 0.165
## F 21001.910 20007.959
## p 0.000 0.000
## Log-likelihood -150969.968 8298.252
## Deviance 1477276362.322 587.149
## AIC 301945.937 -16590.505
## BIC 301969.880 -16566.562
## N 21613 21613
## ======================================================
The model m0 has a little bit better R-squared, but let’s see the residual plots for these two models.
From residual plot for m0 we can see that residuals contain predictive information. With higher fitted_value residuals tend to have larger absolute value. But for model m1 resuduals look more random. So we can say that model m1 has a better fit. Let’s find out how will fit the models with additional factors.
##
## Calls:
## m1: lm(formula = I(log10(price_thousands)) ~ I((sqft_living)^(1/2)),
## data = kc_house)
## m2: lm(formula = I(log10(price_thousands)) ~ I((sqft_living)^(1/2)) +
## view, data = kc_house)
## m3: lm(formula = I(log10(price_thousands)) ~ I((sqft_living)^(1/2)) +
## view + condition, data = kc_house)
## m4: lm(formula = I(log10(price_thousands)) ~ I((sqft_living)^(1/2)) +
## view + condition + waterfront, data = kc_house)
##
## ==========================================================================
## m1 m2 m3 m4
## --------------------------------------------------------------------------
## (Intercept) 1.925*** 2.079*** 2.051*** 2.106***
## (0.005) (0.007) (0.009) (0.011)
## I((sqft_living)^(1/2)) 0.017*** 0.015*** 0.016*** 0.016***
## (0.000) (0.000) (0.000) (0.000)
## view: .L 0.160*** 0.158*** 0.119***
## (0.007) (0.007) (0.008)
## view: .Q 0.021*** 0.022*** -0.008
## (0.006) (0.006) (0.007)
## view: .C 0.073*** 0.072*** 0.056***
## (0.008) (0.008) (0.008)
## view: ^4 -0.017* -0.017** -0.022***
## (0.007) (0.007) (0.007)
## condition: .L 0.115*** 0.116***
## (0.019) (0.019)
## condition: .Q 0.009 0.009
## (0.016) (0.016)
## condition: .C -0.016 -0.016
## (0.012) (0.012)
## condition: ^4 0.033*** 0.033***
## (0.007) (0.007)
## waterfront: .L 0.098***
## (0.011)
## --------------------------------------------------------------------------
## R-squared 0.481 0.511 0.517 0.518
## adj. R-squared 0.481 0.511 0.516 0.518
## sigma 0.165 0.160 0.159 0.159
## F 20007.959 4511.043 2565.894 2325.900
## p 0.000 0.000 0.000 0.000
## Log-likelihood 8298.252 8941.226 9073.073 9113.383
## Deviance 587.149 553.233 546.525 544.490
## AIC -16590.505 -17868.453 -18124.146 -18202.765
## BIC -16566.562 -17812.585 -18036.354 -18106.993
## N 21613 21613 21613 21613
## ==========================================================================
Adding new factors imporved model a little bit. But we can see that some levels of condition are not significant and adding waterfront imporves R-squard only by 0.001. Which isn’t surprising, considering absence of the pattern on the plots. The best version of model can account for 51.8% of the variance in the price of houses, which isn’t much. I guess, some important factors was missed in this data.
For conditions 2-5 the better conditions have higher median price per sqft. But condition 1 has median price/sqft higher than condition 2 and 3. Also price/sqft has a largest variance for houses in condition 1.
Median price/sqft has bump in April and then slightly decrese from Novermber to January. Variance in each month is approximately the same. So I guess it’s true that winter is the most profitable time to buy a property.
The last three plots from the Multivariate section suggest that I can build a linear model and use those variables in the model to predict the price of a house. The results of the model are summarized below.
It was surprising for me that almost all houses built after 2000 seems to be in condition 3. I expected to see new houses in excellent condition, but data showed another pattern. This pattern holds across each level of view and each level of waterfront.
Yes, I created a linear model starting from the price and the sqft_living. The variables in the linear model account for 51.8% of the variance in the price of houses. The addition of the view variable to the model improves the R^2 value by three percents, which is expected based on the visualization above of price vs sqft_living and view.
The distribution of house prices appears to be unimodal on log scale, with a price peak around 450 thousands dollars. Perhaps most people are looking for houses near this price point.
Data point shown in black, boxplots in blue, mean in red. We can see that price/sqft median and mean rises from 2015 Jan to 2015 May. Also there is a slight decrese from 2014 Novermber to 2014 January. Variance in each month is approximately the same. So from these picture it seems that winter is the most profitable time to buy a house.
The plot indicates that a linear model could be constructed to predict the price of houses using price as the outcome variable and living area as the predictor variable. Holding living area constant, houses with higher view levels (0 is worst and 4 is best) are often cheaper than houses with better view to account for additional variability in prices.
The analyzed part of king county houses data set contains informtaion on almost 22000 houses across 10 variables from around 2014. I started by understending the distributions of individual variables in the data set,and then I explored interesting questions and leads as I continued to make observations on plots. Eventually, I explored the price of diamonds across many variables and created a linear model to predict house prices.
There was a clear trend between the living area of houes and its price. I was surprised that year when was built a house didn’t have strong posititve correaltion with price. But there was a correlation with categorical variables: view, condition and waterfront. I struggled understending why almost all houses built after 2000 had medium condition, but this became more clear when I realized that most of the data contained houses in medium condition. For the linear model, all houses were included since information on price, view, condition and waterfront were available for all the houses.
Constructed model was able to account only for 51.8% of the variance in the dataset. So my thoughts here is that dataset didn’t contain informtaion about important factors, which affect the house price. Perhaps, it wasn’t wise to exclude zipcode from consideration, as actually even in the same county prices could depend on neiborhoods. Also proximity to highways or public transportation could affect the price.
Moreover, the model could be much complicated than just a linear. I tried to take a log of price (because price distributions often has long tail) and square root of living area (as it’s actually multiplication of two dimensions). Even givet that model with this variables without transformation showed better R-squared, I decided to keep model with transformation, as it showed better fit on residual plot.
Even if someone will decide to use this model to predict the house price, he/she should keep in mind some limitations. This dataset represents houses sold in 2014 and now prices for the same houses definetely will be higher dut to changes in demand and supply or inflation rates. A more recent dataset would be better to make predictions of house prices and also it could be interesting to analyze more variables, which possibly could affect the price.